Diabetes is a massive problem within american society currently. Though people can live with it, they have to be very dilligent in trying to manage their health. It is currently increasing in the US at an alarming rate in the United States. According to the CDC's records, cases of diabetes have risen to an estimated 34.2 million cases (https://www.diabetesresearch.org/file/national-diabetes-statistics-report-2020.pdf).
Our main objective is to analyse the dataset of attributes related to diabetes and to predict whether that person is diabetic or not. We will be applying machine learning algorithms to try to achieve this goal.
These are the libraries used in this tutorial.
import requests
import pandas as pd
import json
import numpy as np
import matplotlib.pyplot as plot
import seaborn as sns
from sklearn import svm,tree
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import KFold, cross_val_score
from sklearn.datasets import make_classification
from scipy import stats
In the data collection stage of the data life cycle, you need to focus on collecting data from websites and databases.
We have found data from Kaggle at: https://www.kaggle.com/datasets/mathchi/diabetes-data-set
This data has all the attributes needed for predicting diabetes which will aid us to reach our final goal. The data is of 21 year old women at the Pima Indian Heritage. The data is organized with the folloing attributes as described by the website:
Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 or 1)
attributes = pd.read_csv("diabetes.csv")
attributes.head(10)
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
| 1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
| 2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
| 3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
| 4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
| 5 | 5 | 116 | 74 | 0 | 0 | 25.6 | 0.201 | 30 | 0 |
| 6 | 3 | 78 | 50 | 32 | 88 | 31.0 | 0.248 | 26 | 1 |
| 7 | 10 | 115 | 0 | 0 | 0 | 35.3 | 0.134 | 29 | 0 |
| 8 | 2 | 197 | 70 | 45 | 543 | 30.5 | 0.158 | 53 | 1 |
| 9 | 8 | 125 | 96 | 0 | 0 | 0.0 | 0.232 | 54 | 1 |
As you can see, from the data shown above each individual has an number id and has each attribute filled out.
Outcomes, the most important piece of data is the last column.
attributes.describe()
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| count | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 |
| mean | 3.845052 | 120.894531 | 69.105469 | 20.536458 | 79.799479 | 31.992578 | 0.471876 | 33.240885 | 0.348958 |
| std | 3.369578 | 31.972618 | 19.355807 | 15.952218 | 115.244002 | 7.884160 | 0.331329 | 11.760232 | 0.476951 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.078000 | 21.000000 | 0.000000 |
| 25% | 1.000000 | 99.000000 | 62.000000 | 0.000000 | 0.000000 | 27.300000 | 0.243750 | 24.000000 | 0.000000 |
| 50% | 3.000000 | 117.000000 | 72.000000 | 23.000000 | 30.500000 | 32.000000 | 0.372500 | 29.000000 | 0.000000 |
| 75% | 6.000000 | 140.250000 | 80.000000 | 32.000000 | 127.250000 | 36.600000 | 0.626250 | 41.000000 | 1.000000 |
| max | 17.000000 | 199.000000 | 122.000000 | 99.000000 | 846.000000 | 67.100000 | 2.420000 | 81.000000 | 1.000000 |
As we can see, there are 768 individual data points. There are some interesting points to how much the data points vary for example skin thickness can vary from 23 at the 50th percentile to 99 at the maximum.
Other things to note is that there are no abnormal conditions seen outside of things expected to see in diabetics, for example the maximum blood preasure is 122 which is within a normal range and isnt considered to be harmful.
Now that we have gathered our data we need to make it usable for analysis. We will need to make a lot of changes to our data and perform data tidying. In our case the dataset was selected to be exactly what we would need to predict diabetic cases so we will not need to remove rows or categories. However there is a good amount of missing values that we will need to take care of.
lets first take a look at the data.
positive = attributes[attributes['Outcome']==1]
negative = attributes[attributes['Outcome']!=1]
positive.head(10)
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
| 2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
| 4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
| 6 | 3 | 78 | 50 | 32 | 88 | 31.0 | 0.248 | 26 | 1 |
| 8 | 2 | 197 | 70 | 45 | 543 | 30.5 | 0.158 | 53 | 1 |
| 9 | 8 | 125 | 96 | 0 | 0 | 0.0 | 0.232 | 54 | 1 |
| 11 | 10 | 168 | 74 | 0 | 0 | 38.0 | 0.537 | 34 | 1 |
| 13 | 1 | 189 | 60 | 23 | 846 | 30.1 | 0.398 | 59 | 1 |
| 14 | 5 | 166 | 72 | 19 | 175 | 25.8 | 0.587 | 51 | 1 |
| 15 | 7 | 100 | 0 | 0 | 0 | 30.0 | 0.484 | 32 | 1 |
attributes.describe()
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| count | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 |
| mean | 3.845052 | 120.894531 | 69.105469 | 20.536458 | 79.799479 | 31.992578 | 0.471876 | 33.240885 | 0.348958 |
| std | 3.369578 | 31.972618 | 19.355807 | 15.952218 | 115.244002 | 7.884160 | 0.331329 | 11.760232 | 0.476951 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.078000 | 21.000000 | 0.000000 |
| 25% | 1.000000 | 99.000000 | 62.000000 | 0.000000 | 0.000000 | 27.300000 | 0.243750 | 24.000000 | 0.000000 |
| 50% | 3.000000 | 117.000000 | 72.000000 | 23.000000 | 30.500000 | 32.000000 | 0.372500 | 29.000000 | 0.000000 |
| 75% | 6.000000 | 140.250000 | 80.000000 | 32.000000 | 127.250000 | 36.600000 | 0.626250 | 41.000000 | 1.000000 |
| max | 17.000000 | 199.000000 | 122.000000 | 99.000000 | 846.000000 | 67.100000 | 2.420000 | 81.000000 | 1.000000 |
z = np.abs(stats.zscore(attributes))
print(z)
[[0.63994726 0.84832379 0.14964075 ... 0.46849198 1.4259954 1.36589591] [0.84488505 1.12339636 0.16054575 ... 0.36506078 0.19067191 0.73212021] [1.23388019 1.94372388 0.26394125 ... 0.60439732 0.10558415 1.36589591] ... [0.3429808 0.00330087 0.14964075 ... 0.68519336 0.27575966 0.73212021] [0.84488505 0.1597866 0.47073225 ... 0.37110101 1.17073215 1.36589591] [0.84488505 0.8730192 0.04624525 ... 0.47378505 0.87137393 0.73212021]]
positive.describe()
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| count | 268.000000 | 268.000000 | 268.000000 | 268.000000 | 268.000000 | 268.000000 | 268.000000 | 268.000000 | 268.0 |
| mean | 4.865672 | 141.257463 | 70.824627 | 22.164179 | 100.335821 | 35.142537 | 0.550500 | 37.067164 | 1.0 |
| std | 3.741239 | 31.939622 | 21.491812 | 17.679711 | 138.689125 | 7.262967 | 0.372354 | 10.968254 | 0.0 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.088000 | 21.000000 | 1.0 |
| 25% | 1.750000 | 119.000000 | 66.000000 | 0.000000 | 0.000000 | 30.800000 | 0.262500 | 28.000000 | 1.0 |
| 50% | 4.000000 | 140.000000 | 74.000000 | 27.000000 | 0.000000 | 34.250000 | 0.449000 | 36.000000 | 1.0 |
| 75% | 8.000000 | 167.000000 | 82.000000 | 36.000000 | 167.250000 | 38.775000 | 0.728000 | 44.000000 | 1.0 |
| max | 17.000000 | 199.000000 | 114.000000 | 99.000000 | 846.000000 | 67.100000 | 2.420000 | 70.000000 | 1.0 |
negative.describe()
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| count | 500.000000 | 500.0000 | 500.000000 | 500.000000 | 500.000000 | 500.000000 | 500.000000 | 500.000000 | 500.0 |
| mean | 3.298000 | 109.9800 | 68.184000 | 19.664000 | 68.792000 | 30.304200 | 0.429734 | 31.190000 | 0.0 |
| std | 3.017185 | 26.1412 | 18.063075 | 14.889947 | 98.865289 | 7.689855 | 0.299085 | 11.667655 | 0.0 |
| min | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.078000 | 21.000000 | 0.0 |
| 25% | 1.000000 | 93.0000 | 62.000000 | 0.000000 | 0.000000 | 25.400000 | 0.229750 | 23.000000 | 0.0 |
| 50% | 2.000000 | 107.0000 | 70.000000 | 21.000000 | 39.000000 | 30.050000 | 0.336000 | 27.000000 | 0.0 |
| 75% | 5.000000 | 125.0000 | 78.000000 | 31.000000 | 105.000000 | 35.300000 | 0.561750 | 37.000000 | 0.0 |
| max | 13.000000 | 197.0000 | 122.000000 | 60.000000 | 744.000000 | 57.300000 | 2.329000 | 81.000000 | 0.0 |
def tar_sum(data, col, val):
return (data[col] == val).sum()
possum = tar_sum(attributes,"Outcome",1.0)
negsum = tar_sum(attributes,"Outcome",0.0)
total = possum + negsum
print("number of positive outcomes:")
print(possum)
print("percentage of positive outcomes:")
posperc= possum / total *100
print(posperc)
print()
print("number of negative outcomes:")
print(negsum)
print("percentage of negative outcomes:")
negperc = negsum / total *100
print(negperc)
number of positive outcomes: 268 percentage of positive outcomes: 34.89583333333333 number of negative outcomes: 500 percentage of negative outcomes: 65.10416666666666
As we can see there is a good split among the people involved in this data. There are 268 people who have confirmed cases of diabetes and there are 500 who do not have it. It is an approximate 34% positive and 65% negative split.
As we can see from the descriptions of the combined data, the positive, and the negative data there are very big differences between them. This shows that we cannot rely on the combined data to be able to predict diabetic cases and we will have to analyse the datasets differently. We will be visualizing and looking through the dataset with a more focused look later.
The z scores indicate where each point lies on a distribution. It uses mean and standard distribution to show how far from the mean the point is. The z score that we observed for the data above seems is between -3 and 3 so we will use the range -3 to 3 as cutoff for outliers.
attributes = attributes[(z < 3).all(axis=1)]
#attributes.describe()
def tar_sum(data, col, val):
return (data[col] == val).sum()
possum = tar_sum(attributes,"Outcome",1.0)
negsum = tar_sum(attributes,"Outcome",0.0)
total = possum + negsum
print("number of positive outcomes:")
print(possum)
print("percentage of positive outcomes:")
posperc= possum / total *100
print(posperc)
print()
print("number of negative outcomes:")
print(negsum)
print("percentage of negative outcomes:")
negperc = negsum / total *100
print(negperc)
attributes.info()
number of positive outcomes: 227 percentage of positive outcomes: 32.99418604651162 number of negative outcomes: 461 percentage of negative outcomes: 67.00581395348837 <class 'pandas.core.frame.DataFrame'> Int64Index: 688 entries, 0 to 767 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pregnancies 688 non-null int64 1 Glucose 688 non-null int64 2 BloodPressure 688 non-null int64 3 SkinThickness 688 non-null int64 4 Insulin 688 non-null int64 5 BMI 688 non-null float64 6 DiabetesPedigreeFunction 688 non-null float64 7 Age 688 non-null int64 8 Outcome 688 non-null int64 dtypes: float64(2), int64(7) memory usage: 53.8 KB
As you can see, this has reduced the data by quite a bit however the ratios are very similar to where they were before.
when looking at our dataframe, we saw that some features contained 0s when they shouldn't and it doesnt make sense. We will replace these values with NaN.
attributes[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] = attributes[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0,np.NaN)
def plot_missing(data, key):
null_feat = pd.DataFrame(len(data[key]) - data.isnull().sum(), columns = ['Count'])
percentage_null = pd.DataFrame((len(data[key]) - (len(data[key]) - data.isnull().sum()))/len(data[key])*100, columns = ['Percent'])
percentage_null = percentage_null.round(2)
result = pd.concat([null_feat, percentage_null], axis=1, join='inner')
result.plot(y='Percent',use_index=True, kind = 'bar')
plot.title('Missing data percentage per attribute')
plot.ylabel('Percent')
plot.xlabel('Attribute')
return result
percent_missing = plot_missing(attributes, 'Outcome')
Using the boxplot below we can now see that we no longer have 0s for missing values. They are now encoded with NaN values.
plot.style.use('ggplot')
f, ax = plot.subplots(figsize=(11, 15))
ax.set(xlim=(-1, 200))
plot.ylabel('Variables')
plot.title("General Overview Data Set")
ax = sns.boxplot(data = attributes, orient = 'v', palette = 'Set2')
We now have a bunch of missing values that we need to fill. The missing data doesn't seem truly random as it isnt evenly spread among all attributes that has missing values. There isn't enough information on how this data was recorded so we cannot assume anything out of the ordinary so we need to assume that the values are missing at random (MAR).
There are many different imputations but we will be using median imputation (similar to mean) since we have seen in how varied this data can be and the likely hood of outliers.
This is the part of the pipeline called exploratory analysis. At this stage we want to observe any possible trends. We can apply statistical analysis to better support our observations and find evidence of the trends found.
In our case it is convienent to finish imputing variables while visualizing and doing EDA so we will be doing that here as well. We will look at the positive and negative results seperately at this point. We will not look at the combined data, ie both positive and negative data together, as we are looking for differences between the data and we need the two sets to be seperate.
positive = attributes[attributes['Outcome']==1]
negative = attributes[attributes['Outcome']!=1]
def median_found(dataset, var):
temp = dataset[dataset[var].notnull()]
tempmed = temp[var].median()
return tempmed
We will be looking at insulin first to find the median those who tested positive and those who tested negative. As we can see from the table labeled Insulin of a diabetic person, most patients are in the range between 0 and 650.
Comparitavely, the insulin of a healthy person is typically between 0 and 500 which gives a smaller range of values. It also peaks around 100 and tapers off much faster than the positive results.
fig, ax = plot.subplots(2,1, figsize=(40,50))
sns.set(font_scale=5)
sns.distplot(positive['Insulin'], bins = 20, color = 'red', \
ax=ax[0]).set(title='Insulin of a diabetic person', xlabel = 'Insulin')
sns.distplot(negative['Insulin'], bins = 20, color = 'blue', \
ax=ax[1]).set(title='Insulin of a healthy person', xlabel = 'Insulin')
/Users/alessandro/anaconda3/lib/python3.7/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) /Users/alessandro/anaconda3/lib/python3.7/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
[Text(0.5, 1.0, 'Insulin of a healthy person'), Text(0.5, 0, 'Insulin')]
insulin_median = median_found(positive, 'Insulin')
print("positive diagnosed median:")
print(insulin_median)
positive diagnosed median: 165.0
insulin_median = median_found(negative, 'Insulin')
print("negative diagnosed median:")
print(insulin_median)
negative diagnosed median: 100.0
As we can see there is a drastic change in insulin between a healthy person and unhealthy person. A health person has a median of 102.5 and an unhealthy person has a median of 169.5.
Now we will median fill the missing values with their respective group (Diabetic or Healthy).
attributes.loc[(attributes['Outcome'] == 0 ) & (attributes['Insulin'].isnull()), 'Insulin'] = 102.5
attributes.loc[(attributes['Outcome'] == 1 ) & (attributes['Insulin'].isnull()), 'Insulin'] = 169.5
Now we need to move on to other attributes with missing values and do the same for them.
Below we have plotted the glucose data in the same manner as we did Insulin. The positive diabetic glucose information is almost consistent between 75 and 200. It seems to peak at 125 and tapers of at a low rate. Comparatively, the glucose of a normal person peaks at 100 and tapers off at a high rate.
fig, ax = plot.subplots(2,1, figsize=(40,50))
sns.set(font_scale=5)
sns.distplot(positive['Glucose'], bins = 20, color = 'red', \
ax=ax[0]).set(title='Glucose of a diabetic person', xlabel = 'Glucose')
sns.distplot(negative['Glucose'], bins = 20, color = 'blue', \
ax=ax[1]).set(title='Glucose of a healthy person', xlabel = 'Glucose')
/Users/alessandro/anaconda3/lib/python3.7/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) /Users/alessandro/anaconda3/lib/python3.7/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
[Text(0.5, 1.0, 'Glucose of a healthy person'), Text(0.5, 0, 'Glucose')]
insulin_median = median_found(positive, 'Glucose')
print("positive diagnosed median:")
print(insulin_median)
positive diagnosed median: 138.0
insulin_median = median_found(negative, 'Glucose')
print("negative diagnosed median:")
print(insulin_median)
negative diagnosed median: 107.0
attributes.loc[(attributes['Outcome'] == 0 ) & (attributes['Glucose'].isnull()), 'Glucose'] = 107.0
attributes.loc[(attributes['Outcome'] == 1 ) & (attributes['Glucose'].isnull()), 'Glucose'] = 140.0
We will now move on to Blood Pressure.
fig, ax = plot.subplots(2,1, figsize=(40,50))
sns.set(font_scale=5)
sns.distplot(positive['BloodPressure'], bins = 20, color = 'red', \
ax=ax[0]).set(title='BloodPressure of a diabetic person', xlabel = 'BloodPressure')
sns.distplot(negative['BloodPressure'], bins = 20, color = 'blue', \
ax=ax[1]).set(title='BloodPressure of a healthy person', xlabel = 'BloodPressure')
/Users/alessandro/anaconda3/lib/python3.7/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) /Users/alessandro/anaconda3/lib/python3.7/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
[Text(0.5, 1.0, 'BloodPressure of a healthy person'), Text(0.5, 0, 'BloodPressure')]
insulin_median = median_found(positive, 'BloodPressure')
print("positive diagnosed median:")
print(insulin_median)
positive diagnosed median: 76.0
insulin_median = median_found(negative, 'BloodPressure')
print("negative diagnosed median:")
print(insulin_median)
negative diagnosed median: 70.0
attributes.loc[(attributes['Outcome'] == 0 ) & (attributes['BloodPressure'].isnull()), 'BloodPressure'] = 70.0
attributes.loc[(attributes['Outcome'] == 1 ) & (attributes['BloodPressure'].isnull()), 'BloodPressure'] = 74.5
fig, ax = plot.subplots(2,1, figsize=(40,50))
sns.set(font_scale=5)
sns.distplot(positive['SkinThickness'], bins = 20, color = 'red', \
ax=ax[0]).set(title='SkinThickness of a diabetic person', xlabel = 'SkinThickness')
sns.distplot(negative['SkinThickness'], bins = 20, color = 'blue', \
ax=ax[1]).set(title='SkinThickness of a healthy person', xlabel = 'SkinThickness')
/Users/alessandro/anaconda3/lib/python3.7/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) /Users/alessandro/anaconda3/lib/python3.7/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
[Text(0.5, 1.0, 'SkinThickness of a healthy person'), Text(0.5, 0, 'SkinThickness')]
insulin_median = median_found(positive, 'SkinThickness')
print("positive diagnosed median:")
print(insulin_median)
positive diagnosed median: 32.0
insulin_median = median_found(negative, 'SkinThickness')
print("negative diagnosed median:")
print(insulin_median)
negative diagnosed median: 27.0
attributes.loc[(attributes['Outcome'] == 0 ) & (attributes['BloodPressure'].isnull()), 'BloodPressure'] = 27.0
attributes.loc[(attributes['Outcome'] == 1 ) & (attributes['BloodPressure'].isnull()), 'BloodPressure'] = 32.0
fig, ax = plot.subplots(2,1, figsize=(40,50))
sns.set(font_scale=5)
sns.distplot(positive['BMI'], bins = 20, color = 'red', \
ax=ax[0]).set(title='BMI of a diabetic person', xlabel = 'BMI')
sns.distplot(negative['BMI'], bins = 20, color = 'blue', \
ax=ax[1]).set(title='BMI of a healthy person', xlabel = 'BMI')
/Users/alessandro/anaconda3/lib/python3.7/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) /Users/alessandro/anaconda3/lib/python3.7/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
[Text(0.5, 1.0, 'BMI of a healthy person'), Text(0.5, 0, 'BMI')]
insulin_median = median_found(positive, 'BMI')
print("positive diagnosed median:")
print(insulin_median)
positive diagnosed median: 34.2
insulin_median = median_found(negative, 'BMI')
print("negative diagnosed median:")
print(insulin_median)
negative diagnosed median: 30.4
attributes.loc[(attributes['Outcome'] == 0 ) & (attributes['BMI'].isnull()), 'BMI'] = 27.0
attributes.loc[(attributes['Outcome'] == 1 ) & (attributes['BMI'].isnull()), 'BMI'] = 32.0
Lets take a quick look at the confusion matrix to see if we can find the highest correlations.
sns.set(font_scale=1)
corr = attributes.corr()
sns.heatmap(corr, annot = True)
<AxesSubplot:>
sns.pairplot(attributes, hue='Outcome')
<seaborn.axisgrid.PairGrid at 0x7fa2d99236d0>
Seemingly the highest correlation there is is between the Skin thickness and BMI. This is a moderate correlation and the closest one to it would be age and pregnancies.
In this part we will finally implement models that will be able to predict diabetic cases through medical records. Machine learning has many uses including the capacity to classify data and create regression based on the data. We will not be using regression as they are unneeded for our project but classification of data is essential.
We will be using:
All learning models need to be trained so will will use and 80/20 split to train the models. We will allocate the expected output to be the Outcome attribute and the rest to be the predictors.
X = attributes[['Pregnancies', 'Glucose', 'BloodPressure', 'Insulin', 'BMI', 'DiabetesPedigreeFunction','Age']]
y = attributes['Outcome']
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
(550, 7) (550,) (138, 7) (138,)
Linear Discriminant Analysis is a linear model for classification and dimensional reduction. It is most often used for feature extraction in pattern classification problems which is perfect for our analysis.
LDA is currently one of the most popular classification models that is very good as binary classification. There are some shortcomings like how linear decision boundaries can be uneffective when it comes to seperating non-linearly seperable classes when more flexible boundaries are desired.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state = 15)
model = LinearDiscriminantAnalysis()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy for LDA by train/test split:")
print(accuracy_score(y_test, y_pred))
cv = KFold(n_splits=10,random_state= 1, shuffle = True)
cv_results = cross_val_score(model, X, y, cv = cv, scoring = 'accuracy', n_jobs=-1)
print("Accuracy for LDA using 10-fold cross validation:")
print(cv_results.mean())
#linear SVM
#Non Linear SVM
Accuracy for LDA by train/test split: 0.8115942028985508 Accuracy for LDA using 10-fold cross validation: 0.7921355498721228
We can see that the resulting accuracy both with the internal function which measures accuracy and the 10 fold cross validation that calculates accuracy pins it at around 80% accuracy.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state = 10)
rfc = RandomForestClassifier(n_estimators=100,max_features='sqrt')
rfc.fit(X_train, y_train),
y_pred = rfc.predict(X_test)
rfc.score(X_test, y_test)
0.9130434782608695
mat = confusion_matrix(y_test, y_pred)
plot.figure(figsize=(7, 5))
sns.heatmap(mat, annot=True)
<AxesSubplot:>
As we can see, the confusion matrix seems to classify things fairly well. out of 97 total non diabetics in our test case, we successfully classified 92 and mis classified 5. Out of the 41 diabetic cases 34 were correctly classified while 7 werent. As seen below, the accuracy is around 85% which is still very strong.
target_names = ['Diabetes', 'Normal']
print(classification_report(y_test, y_pred, target_names=target_names))
cv = KFold(n_splits=10,random_state= 1, shuffle = True)
cv_results = cross_val_score(rfc, X, y, cv = cv, scoring = 'accuracy', n_jobs=-1)
print("Accuracy for Random Forest using 10-fold cross validation:")
print(cv_results.mean())
precision recall f1-score support
Diabetes 0.93 0.95 0.94 97
Normal 0.87 0.83 0.85 41
accuracy 0.91 138
macro avg 0.90 0.89 0.89 138
weighted avg 0.91 0.91 0.91 138
Accuracy for Random Forest using 10-fold cross validation:
0.876470588235294
For this learning algorithm we will be applying all attributes to the model. In general, a decision tree consists of many paths nodes that originate from the root of the tree. These paths go through different nodes at whcih a decision is made that will determine where to go next.
In our case, we want to classify whether a person does or doesnt have diabetes so we want the tree to split the inputted data continously in order to correctly predict categories that a datapoint would fall under. In our case we will not be using a regression for our decision tree as we do not need it to predict a value for an attribute but simply determine whether a certain set of values mean diabetes or not.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state = 0)
dtc = DecisionTreeClassifier(max_depth = 4, random_state = 0)
dtc = dtc.fit(X_train,y_train)
# now predict the value
y_hat = dtc.predict(X_test)
accuracy_score(y_hat,y_test)
0.8623188405797102
#display decision tree
tree.plot_tree(dtc)
[Text(154.84500000000003, 195.696, 'X[3] <= 121.0\ngini = 0.434\nsamples = 550\nvalue = [375, 175]'), Text(83.70000000000002, 152.208, 'X[4] <= 49.95\ngini = 0.123\nsamples = 319\nvalue = [298, 21]'), Text(66.96000000000001, 108.72, 'X[5] <= 0.728\ngini = 0.113\nsamples = 317\nvalue = [298, 19]'), Text(33.480000000000004, 65.232, 'X[3] <= 113.5\ngini = 0.076\nsamples = 279\nvalue = [268, 11]'), Text(16.740000000000002, 21.744, 'gini = 0.058\nsamples = 270\nvalue = [262, 8]'), Text(50.220000000000006, 21.744, 'gini = 0.444\nsamples = 9\nvalue = [6, 3]'), Text(100.44000000000001, 65.232, 'X[4] <= 31.8\ngini = 0.332\nsamples = 38\nvalue = [30, 8]'), Text(83.70000000000002, 21.744, 'gini = 0.1\nsamples = 19\nvalue = [18, 1]'), Text(117.18, 21.744, 'gini = 0.465\nsamples = 19\nvalue = [12, 7]'), Text(100.44000000000001, 108.72, 'gini = 0.0\nsamples = 2\nvalue = [0, 2]'), Text(225.99000000000004, 152.208, 'X[6] <= 28.5\ngini = 0.444\nsamples = 231\nvalue = [77, 154]'), Text(184.14000000000001, 108.72, 'X[1] <= 160.0\ngini = 0.483\nsamples = 81\nvalue = [48, 33]'), Text(167.40000000000003, 65.232, 'X[3] <= 169.75\ngini = 0.444\nsamples = 72\nvalue = [48, 24]'), Text(150.66000000000003, 21.744, 'gini = 0.498\nsamples = 43\nvalue = [23, 20]'), Text(184.14000000000001, 21.744, 'gini = 0.238\nsamples = 29\nvalue = [25, 4]'), Text(200.88000000000002, 65.232, 'gini = 0.0\nsamples = 9\nvalue = [0, 9]'), Text(267.84000000000003, 108.72, 'X[3] <= 187.5\ngini = 0.312\nsamples = 150\nvalue = [29, 121]'), Text(234.36, 65.232, 'X[3] <= 142.0\ngini = 0.19\nsamples = 113\nvalue = [12, 101]'), Text(217.62000000000003, 21.744, 'gini = 0.495\nsamples = 20\nvalue = [9, 11]'), Text(251.10000000000002, 21.744, 'gini = 0.062\nsamples = 93\nvalue = [3, 90]'), Text(301.32000000000005, 65.232, 'X[4] <= 35.45\ngini = 0.497\nsamples = 37\nvalue = [17, 20]'), Text(284.58000000000004, 21.744, 'gini = 0.432\nsamples = 19\nvalue = [13, 6]'), Text(318.06000000000006, 21.744, 'gini = 0.346\nsamples = 18\nvalue = [4, 14]')]
As you can see this decision tree results in 85 percent accuracy. We used a max depth of 4 which seen optimal as we can see below higher depths seems to overfit the data.
dtc = DecisionTreeClassifier(max_depth = 7, random_state = 0)
dtc = dtc.fit(X_train,y_train)
# now predict the value
y_hat = dtc.predict(X_test)
accuracy_score(y_hat,y_test)
0.8478260869565217
#display decision tree
tree.plot_tree(dtc)
[Text(122.80573770491803, 203.85, 'X[3] <= 121.0\ngini = 0.448\nsamples = 550\nvalue = [364, 186]'), Text(65.86229508196722, 176.67000000000002, 'X[4] <= 50.0\ngini = 0.131\nsamples = 312\nvalue = [290, 22]'), Text(60.37377049180328, 149.49, 'X[3] <= 109.0\ngini = 0.121\nsamples = 310\nvalue = [290, 20]'), Text(38.41967213114754, 122.31, 'X[3] <= 99.5\ngini = 0.092\nsamples = 291\nvalue = [277, 14]'), Text(27.442622950819672, 95.13, 'X[3] <= 95.0\ngini = 0.205\nsamples = 112\nvalue = [99, 13]'), Text(21.95409836065574, 67.94999999999999, 'X[1] <= 127.0\ngini = 0.167\nsamples = 109\nvalue = [99, 10]'), Text(10.97704918032787, 40.77000000000001, 'X[5] <= 0.677\ngini = 0.112\nsamples = 101\nvalue = [95, 6]'), Text(5.488524590163935, 13.590000000000003, 'gini = 0.048\nsamples = 81\nvalue = [79, 2]'), Text(16.465573770491805, 13.590000000000003, 'gini = 0.32\nsamples = 20\nvalue = [16, 4]'), Text(32.93114754098361, 40.77000000000001, 'X[6] <= 29.5\ngini = 0.5\nsamples = 8\nvalue = [4, 4]'), Text(27.442622950819672, 13.590000000000003, 'gini = 0.0\nsamples = 4\nvalue = [4, 0]'), Text(38.41967213114754, 13.590000000000003, 'gini = 0.0\nsamples = 4\nvalue = [0, 4]'), Text(32.93114754098361, 67.94999999999999, 'gini = 0.0\nsamples = 3\nvalue = [0, 3]'), Text(49.39672131147541, 95.13, 'X[1] <= 160.5\ngini = 0.011\nsamples = 179\nvalue = [178, 1]'), Text(43.90819672131148, 67.94999999999999, 'gini = 0.0\nsamples = 171\nvalue = [171, 0]'), Text(54.885245901639344, 67.94999999999999, 'X[1] <= 164.5\ngini = 0.219\nsamples = 8\nvalue = [7, 1]'), Text(49.39672131147541, 40.77000000000001, 'gini = 0.0\nsamples = 1\nvalue = [0, 1]'), Text(60.37377049180328, 40.77000000000001, 'gini = 0.0\nsamples = 7\nvalue = [7, 0]'), Text(82.32786885245902, 122.31, 'X[5] <= 0.225\ngini = 0.432\nsamples = 19\nvalue = [13, 6]'), Text(71.35081967213115, 95.13, 'X[1] <= 100.0\ngini = 0.375\nsamples = 4\nvalue = [1, 3]'), Text(65.86229508196722, 67.94999999999999, 'gini = 0.0\nsamples = 1\nvalue = [1, 0]'), Text(76.83934426229509, 67.94999999999999, 'gini = 0.0\nsamples = 3\nvalue = [0, 3]'), Text(93.30491803278689, 95.13, 'X[2] <= 74.0\ngini = 0.32\nsamples = 15\nvalue = [12, 3]'), Text(87.81639344262295, 67.94999999999999, 'gini = 0.0\nsamples = 9\nvalue = [9, 0]'), Text(98.79344262295082, 67.94999999999999, 'X[0] <= 0.5\ngini = 0.5\nsamples = 6\nvalue = [3, 3]'), Text(93.30491803278689, 40.77000000000001, 'gini = 0.0\nsamples = 2\nvalue = [2, 0]'), Text(104.28196721311475, 40.77000000000001, 'X[4] <= 25.05\ngini = 0.375\nsamples = 4\nvalue = [1, 3]'), Text(98.79344262295082, 13.590000000000003, 'gini = 0.0\nsamples = 1\nvalue = [1, 0]'), Text(109.77049180327869, 13.590000000000003, 'gini = 0.0\nsamples = 3\nvalue = [0, 3]'), Text(71.35081967213115, 149.49, 'gini = 0.0\nsamples = 2\nvalue = [0, 2]'), Text(179.74918032786886, 176.67000000000002, 'X[6] <= 28.5\ngini = 0.429\nsamples = 238\nvalue = [74, 164]'), Text(135.8409836065574, 149.49, 'X[4] <= 28.8\ngini = 0.492\nsamples = 89\nvalue = [50, 39]'), Text(120.74754098360656, 122.31, 'X[2] <= 55.0\ngini = 0.1\nsamples = 19\nvalue = [18, 1]'), Text(115.25901639344262, 95.13, 'X[3] <= 272.25\ngini = 0.5\nsamples = 2\nvalue = [1, 1]'), Text(109.77049180327869, 67.94999999999999, 'gini = 0.0\nsamples = 1\nvalue = [0, 1]'), Text(120.74754098360656, 67.94999999999999, 'gini = 0.0\nsamples = 1\nvalue = [1, 0]'), Text(126.23606557377049, 95.13, 'gini = 0.0\nsamples = 17\nvalue = [17, 0]'), Text(150.9344262295082, 122.31, 'X[1] <= 127.5\ngini = 0.496\nsamples = 70\nvalue = [32, 38]'), Text(137.21311475409837, 95.13, 'X[3] <= 169.75\ngini = 0.431\nsamples = 35\nvalue = [24, 11]'), Text(131.72459016393444, 67.94999999999999, 'X[3] <= 166.0\ngini = 0.499\nsamples = 23\nvalue = [12, 11]'), Text(126.23606557377049, 40.77000000000001, 'X[0] <= 4.5\ngini = 0.245\nsamples = 14\nvalue = [12, 2]'), Text(120.74754098360656, 13.590000000000003, 'gini = 0.142\nsamples = 13\nvalue = [12, 1]'), Text(131.72459016393444, 13.590000000000003, 'gini = 0.0\nsamples = 1\nvalue = [0, 1]'), Text(137.21311475409837, 40.77000000000001, 'gini = 0.0\nsamples = 9\nvalue = [0, 9]'), Text(142.7016393442623, 67.94999999999999, 'gini = 0.0\nsamples = 12\nvalue = [12, 0]'), Text(164.65573770491804, 95.13, 'X[5] <= 0.314\ngini = 0.353\nsamples = 35\nvalue = [8, 27]'), Text(153.67868852459017, 67.94999999999999, 'X[1] <= 168.0\ngini = 0.375\nsamples = 8\nvalue = [6, 2]'), Text(148.19016393442624, 40.77000000000001, 'gini = 0.0\nsamples = 6\nvalue = [6, 0]'), Text(159.1672131147541, 40.77000000000001, 'gini = 0.0\nsamples = 2\nvalue = [0, 2]'), Text(175.6327868852459, 67.94999999999999, 'X[3] <= 199.0\ngini = 0.137\nsamples = 27\nvalue = [2, 25]'), Text(170.14426229508197, 40.77000000000001, 'gini = 0.0\nsamples = 21\nvalue = [0, 21]'), Text(181.12131147540984, 40.77000000000001, 'X[3] <= 225.0\ngini = 0.444\nsamples = 6\nvalue = [2, 4]'), Text(175.6327868852459, 13.590000000000003, 'gini = 0.0\nsamples = 2\nvalue = [2, 0]'), Text(186.60983606557377, 13.590000000000003, 'gini = 0.0\nsamples = 4\nvalue = [0, 4]'), Text(223.65737704918033, 149.49, 'X[2] <= 61.0\ngini = 0.27\nsamples = 149\nvalue = [24, 125]'), Text(186.60983606557377, 122.31, 'X[3] <= 154.75\ngini = 0.497\nsamples = 13\nvalue = [7, 6]'), Text(181.12131147540984, 95.13, 'gini = 0.0\nsamples = 3\nvalue = [3, 0]'), Text(192.0983606557377, 95.13, 'X[3] <= 183.0\ngini = 0.48\nsamples = 10\nvalue = [4, 6]'), Text(186.60983606557377, 67.94999999999999, 'gini = 0.0\nsamples = 5\nvalue = [0, 5]'), Text(197.58688524590164, 67.94999999999999, 'X[1] <= 158.0\ngini = 0.32\nsamples = 5\nvalue = [4, 1]'), Text(192.0983606557377, 40.77000000000001, 'gini = 0.0\nsamples = 4\nvalue = [4, 0]'), Text(203.07540983606557, 40.77000000000001, 'gini = 0.0\nsamples = 1\nvalue = [0, 1]'), Text(260.7049180327869, 122.31, 'X[3] <= 177.5\ngini = 0.219\nsamples = 136\nvalue = [17, 119]'), Text(230.51803278688524, 95.13, 'X[3] <= 137.5\ngini = 0.121\nsamples = 93\nvalue = [6, 87]'), Text(219.54098360655738, 67.94999999999999, 'X[6] <= 39.5\ngini = 0.463\nsamples = 11\nvalue = [4, 7]'), Text(214.05245901639344, 40.77000000000001, 'X[5] <= 0.497\ngini = 0.444\nsamples = 6\nvalue = [4, 2]'), Text(208.5639344262295, 13.590000000000003, 'gini = 0.0\nsamples = 4\nvalue = [4, 0]'), Text(219.54098360655738, 13.590000000000003, 'gini = 0.0\nsamples = 2\nvalue = [0, 2]'), Text(225.0295081967213, 40.77000000000001, 'gini = 0.0\nsamples = 5\nvalue = [0, 5]'), Text(241.4950819672131, 67.94999999999999, 'X[3] <= 168.75\ngini = 0.048\nsamples = 82\nvalue = [2, 80]'), Text(236.00655737704918, 40.77000000000001, 'X[3] <= 166.5\ngini = 0.231\nsamples = 15\nvalue = [2, 13]'), Text(230.51803278688524, 13.590000000000003, 'gini = 0.142\nsamples = 13\nvalue = [1, 12]'), Text(241.4950819672131, 13.590000000000003, 'gini = 0.5\nsamples = 2\nvalue = [1, 1]'), Text(246.98360655737704, 40.77000000000001, 'gini = 0.0\nsamples = 67\nvalue = [0, 67]'), Text(290.89180327868854, 95.13, 'X[0] <= 5.5\ngini = 0.381\nsamples = 43\nvalue = [11, 32]'), Text(268.9377049180328, 67.94999999999999, 'X[4] <= 46.3\ngini = 0.165\nsamples = 22\nvalue = [2, 20]'), Text(257.96065573770494, 40.77000000000001, 'X[5] <= 0.332\ngini = 0.095\nsamples = 20\nvalue = [1, 19]'), Text(252.47213114754098, 13.590000000000003, 'gini = 0.375\nsamples = 4\nvalue = [1, 3]'), Text(263.4491803278689, 13.590000000000003, 'gini = 0.0\nsamples = 16\nvalue = [0, 16]'), Text(279.9147540983607, 40.77000000000001, 'X[6] <= 44.5\ngini = 0.5\nsamples = 2\nvalue = [1, 1]'), Text(274.42622950819674, 13.590000000000003, 'gini = 0.0\nsamples = 1\nvalue = [0, 1]'), Text(285.4032786885246, 13.590000000000003, 'gini = 0.0\nsamples = 1\nvalue = [1, 0]'), Text(312.8459016393443, 67.94999999999999, 'X[4] <= 35.65\ngini = 0.49\nsamples = 21\nvalue = [9, 12]'), Text(301.8688524590164, 40.77000000000001, 'X[1] <= 165.0\ngini = 0.397\nsamples = 11\nvalue = [8, 3]'), Text(296.3803278688525, 13.590000000000003, 'gini = 0.198\nsamples = 9\nvalue = [8, 1]'), Text(307.35737704918034, 13.590000000000003, 'gini = 0.0\nsamples = 2\nvalue = [0, 2]'), Text(323.82295081967214, 40.77000000000001, 'X[3] <= 191.0\ngini = 0.18\nsamples = 10\nvalue = [1, 9]'), Text(318.3344262295082, 13.590000000000003, 'gini = 0.0\nsamples = 1\nvalue = [1, 0]'), Text(329.3114754098361, 13.590000000000003, 'gini = 0.0\nsamples = 9\nvalue = [0, 9]')]
After viewing the data from the graphs we can safely determine that higher glucose values can determine diabetes. The other attributes are also important but not on the level of Glucose. We have seen that when Insulin is at a certain point, there is a strong likelyhood of being diabetic. Another good indicator is having a high number of preganencies as well as higher age.
As we can see, Random forest is by far the most accurate model while decision is the next highest. Though we understand that our dataset wasnt very large, (it contained 700 datapoints) we can apply our model to the world and confindentally be able to assess a person's likelyhood of being diabetic.